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WormBase (www.wormbase.org) has been serving the scientific community for over 1 1 years as the central repository for 
genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its 
beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now 
represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for 
around 20 nematodes. In this article, we focus on WormBase's role of genome sequence annotation, describing how we 
annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches 
to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as 
modENCODE. 



Introduction 

WormBase seeks to present an integrative view of nematode 
biology by in-depth curation of the research on C. elegans and 
other members of this animal family. To this end we integrate 
genomic sequences and annotations with curated data from 
genetic, developmental, physiological, behavioral and evolutionary 
studies. We provide multiple streams of access to the data, 
including the main website portal (www.wormbase.org), genome 
browsers, sequence search services, and application pro- 
gramming interfaces. WormBase aims to be the central repository 
and portal for nematode genomic data. 

The activities of the WormBase consortium can be broadly 
classified into three groups: (1) curation of C. elegans literature 
and associated research and development; (2) user interface 
design, development and maintenance and (3) genome sequence 
annotation, analysis and comparative genomics. The volume of 
nematode data has exploded in recent years, and WormBase has 
had to respond accordingly in all three of these areas. ^'^ For 
example, as the volume and variety of information has increased, 
its presentation to the community in a clear and accessible way 
requires new forms of display. We have responded to this 
challenge by completely redesigning the WormBase web-interface 
(Harris et al., manuscript in preparation). In this article, we focus 



on our remit to provide integrated, coherent genome annotation for 
a large (and growing) collection of nematode genome sequences and 
strains. We also summarize our release production cycle and analysis 
pipelines, and describe how they affect the timeline between data 
submission and its subsequent public release. 

Integrating and Annotating IVIultiple 
Nematode Genomes 

WormBase now hosts genomic data for nearly 20 nematodes 
(see Table 1, and refs. 3-14), representing species of evolutionary, 
biomedical and agricultural interest. Recent additions include 
the parasitic nematodes Trichinella spiralis^' Ascaris suum^ and 
Bursaphelenchus xylophilus? The maturity of genome sequence 
and annotation in WormBase varies widely between species. At 
one end of the spectrum is the C. elegans genome, which was 
completed over a number of years using traditional physical 
mapping and clone-by-clone sequencing and finishing,'^ and 
which has highly curated annotation. More recently we have 
seen a number of genome sequences generated by new high- 
throughput low-cost technologies and many of these genomes 
are inevitably fragmented and incomplete; additionally, there is 
relatively little published functional information about many of 
these species. 
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WormBase undertakes different responsibilities for each of 
these species, which can include (1) administration of the genome 
sequence; (2) curation of gene models and other sequence 
features; (3) curation of non-sequence-based data from the 
literature and (4) tracking of identifiers forward through different 
versions of the genome sequence and annotation. The specific way 
in which we manage the data for a species depends (primarily) 
on whether we curate gene models and other features for it. It is 
therefore useful for the sake of discussion to classify the species 
into two groups: core (WormBase curated gene models) and 
non-core. As of release WS230, the core species are C. elegans, 
C. briggsae, C. remanei, C. hrenneri and C. japonica. 

Analyzing and presenting data for an ever-increasing number 
of nematode genomes requires methods that scale well. We deploy 
a standard automatic analysis pipeline to annotate all the species 
we house (core and non-core), including repeat prediction, cDNA 
alignments, the determination of homology relationships, and 
protein domain identification. If a genome sequence for a non- 
core species is submitted without a gene-set, we also run an 
in-house gene prediction pipeline that uses CEGMA^^ to accurately 
identify a small, universally conserved set of gene models. These 
are then used to train parameters for AUGUSTUS, which we 
then apply using protein homologies and any available RNASeq 
and other transcript data as supporting evidence. In some cases, 
these internally-produced gene predictions are later replaced by a 
canonical set of models provided by the submitters. 

Updating an existing species in WormBase with a new 
assembly and/or gene-set presents additional challenges, because 
users rely on stable identifiers to track their entities of interest, 
which must be propagated forward to corresponding features 
in subsequent releases. For core species, identifiers are actively 
managed and tracked using our own curation software infrastruc- 
ture. For non-core species, we use the EnsembP^ stable-identifier 
mapping software for this task. 

The principal way in which we draw information from multiple 
species together is by connecting genes via orthology and paralogy 
relationships to genes in other species (both nematode and other 
model organisms such as human, mouse and fly). As of WS230, 
we include relationships published by the following projects and 
resources: InParanoid^^ (version 7); TreeFam^^ (version 7); the 
Othologous Matrix Project^^ (OMA, August 2009/08 version); 
OrthoMCL;^^ PantherDB^^'^^ (version 7); and EnsembP''^^ 
(version 65). In addition, we curate orthology calls from the 
literature (e.g., Hillier et al., ref 8) and direct submissions. We 
also use data in eggNOG^'^ (version 3.0) to cluster genes into 
functionally characterized homologous groups. 

These resources are inevitably based on snapshots of the gene 
models, taken at various times. For our core species however, 
particularly C. elegans^ the gene models are in a state of flux, being 
revised and improved on the basis of the latest evidence. In order 
to infer up-to-date nematode homology relationships for the 
latest gene models, we run the Ensembl Compara GeneTree 
pipeline^^ as part of the preparation for every WormBase release. 
The resulting gene trees are used to infer additional current 
orthology relationships to those obtained by import from the 
third-party resources and direct submission. 



One way in which we use the orthology relationships internally 
is to project WormBase-approved gene names^^ onto orthologous 
gene(s) of other nematode species. For this a conservative 
approach is adopted: each proposed gene name is required to be 
supported by an unambiguous one to one orthology connection 
according to the majority of available source analyses. 

We also use Ensembl Compara DNA pipeline^^ to produce 
whole-genome multiple alignments of all genomes in WormBase 
and derived genome conservation tracks (using GERP^^). 
However, as the genetic diversity of the species collection in 
WormBase continues to increase, a single multiple alignment for 
all nematodes becomes less appropriate. We therefore propose 
to replace it with a series of pairwise alignments, providing 
multiple alignments only for selected subsets of species. 

Sequence Curation 

WormBase adopts an anomaly-driven approach to curation, 
whereby discrepancies between current gene models and align- 
ment data are identified and flagged as curation targets. We have 
implemented a software application (CurationTool) that identi- 
fies these discrepancies and scores them according to their degree 
of discordance, presenting the results to the curator using a 
graphical user interface. An in-depth discussion of CurationTool 
and our anomaly-driven curation is presented elsewhere.^^ 

For protein-coding genes, WormBase curates only the protein- 
coding portion (CDS) of the full transcript. For our core species, 
we use the high-confidence subset of cDNA alignments over- 
laying the curated CDS models to infer a set of full-length 
transcripts (including 5' and 3' untranslated regions), using a 
custom algorithm (unpublished). In the past, the accuracy of this 
process has been sensitive to artifacts such as alignment errors or 
chimeric cDNAs, but we have recently improved the algorithm to 
take these factors into account. 

The primary line of evidence for gene model curation is 
transcript data. In addition to cDNAs deposited in the nucleotide 
archives, we draw data from numerous resources, publications 
and direct submissions. We also align all RNASeq data deposited 
in the Short Read Archive (SRA) to our core species using 
TopHat,^^ and infer gene expression estimates for a variety of life 
stages and environmental conditions using Cufflinks. 

WormBase is committed to act as the ultimate repository for 
data coming from the nematode half of the modENCODE^^'^^ 
project. Most data sets have been accessible via the genome 
browser since the summer of 2010. To extract the maximum 
utility from the data, it is integrated fully into our database, by 
extending the data models where necessary and adding full cross- 
referencing and connectivity with existing WormBase objects. 
To date, the focus for full integration has been on data sets with 
high impact on gene model and other sequence feature curation, 
namely: trans-splice sites;^^ poly-A cleavage sites and untranslated 
regions;^^'^^ large-scale EST sets (P. Green; data retrieved from 
nucleotide archives); mass-spectrometry peptide sequences;^^ and 
RNASeq transcripts, and derived gene-predictions. 

The data of highest impact for curation has been the RNASeq 
transcriptome, and this has been used in a number of different 
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ways. First, the modENCODE "genelets" (fragmentary gene 
models constructed using RNASeq data from 14 life stages) have 
been used to produce a new anomaly type for CurationTool that 
highlights potential cases where adjacent genes could be merged. 
To date, over three hundred cases displaying this anomaly have 
been scrutinized, of which approximately 35% resulted in a 
merge, and a further 10% some other change (for example the 
movement of an exon from one gene to another). Second, we 
have re-visited the source RNASeq data and analyzed it using 
the Tophat/Cufflnks pipeline^^'^^ to identify candidate "RNASeq- 
splice" features. These can be used both to confirm introns 
already part of curated gene models, and also to suggest changes 
to existing gene models or new isoforms. Third, the strand bias 
characteristic of the modENCODE RNASeq alignments'^ has 
been extremely useful for curators to resolve ambiguities in 
the definition of the 5' and 3' ends of genes. Finally, the 
modENCODE RNASeq data has allowed us to make corrections 
to the C. elegans reference genome itself By taking proposed 
errors and verifying them using data from a private submission 
of high-throughput-sequencing (J. Ahringer and M. Berriman, 
pers. comm.), we have been able to make 156 genome sequence 
corrections (110 insertions, 44 deletions and 2 substitutions), 
resulting in the correction of 100 gene models. 

Additionally, since the data from modENCODE began to 
become available from the project Data Co-ordination Centre, 
the following data sets have been subjected to rigorous internal 
quality control and fully integrated into the database: ^300 
Highly Occupied Target (HOT) regions;^' ^7,000 non-coding 
RNA genes;'' the probable parent for --1,000 pseudogenes;" and 
-21,000 three-prime UTRs from the UTRome project.'^ We will 
prioritise the incorporation of the transcription-factor binding 
site and chromatin accessibility data as soon as the final versions 
of these data sets are made available. 

We have also worked with groups performing their own 
analysis of the modENCODE data. For example, a study of the 
modENCODE RNASeq reads (T. Blumenthal, pers. comm.) has 
resulted in significant improvements to the operon data set. This 
has involved identifying cases where fewer than 5% of the trans- 
splice leader reads for "internal" genes (i.e., genes other than the 
first) were SL2 type, and modifying the gene content of the 
operons accordingly. 

In addition to modENCODE, we continue to draw in data 
from the scientific literature and direct submissions, often com- 
bining different data sources to assist in making correct 
predictions. The modENCODE poly-A site data has been 
supplemented with a corresponding data set from an independent 
study.'^ These two data sets have only 25% redundancy, and over 
80% of coding genes now have an annotated polyA site in 
WormBase. Gene predictions by genBlastC'^ based on BLAST 
homologies to C. elegans proteins have also proved valuable for 
the curation of C. briggsae, C. brenneri, and C. remenei. 

We can assess gene-model accuracy in the presence of 
fragmentary transcript evidence by measuring the proportion of 
curated introns that are confirmed by spliced cDNA evidence. 
For WS230, the proportion of C. elegans curated CDS introns 
confirmed by traditional cDNA, modENCODE RNASeq and 



mass-spectrometry evidences is 83%, 88% and 14% respectively. 
Overall, 93% of curated introns are confirmed and 82% of 
CDS models have all of their introns confirmed by at least one 
of these three lines of evidence; the corresponding measurements 
for the final release prior to modENCODE (WS200, February 
2009) were 74% and 56%, demonstrating the value of the pro- 
ject in increasing the accuracy and confidence of C. elegans gene 
models. 

Intraspecies Variation 

Similar to many other resources, WormBase captures within- 
species variation as differences (insertions, deletions and substitu- 
tions) with respect to the genome sequence of the reference 
strain. We expect variation data for many nematode species in 
the future, but at present almost all the data we house is for 
C. elegans. 

Historically, the majority of variation data we have processed 
has been from laboratory-manipulated strains. We maintain close 
working relationships and established data exchange protocols 
with the Caenorhabditis Genetics Center (CGC; www.cbs.umn. 
edu/CGC), the C. elegans Gene Knockout Consortium 
(GKC; www.celeganskoconsortium.omrf org), and the National 
BioResource Project of Japan (NBRP; www.shigen.nig.ac.jp/c. 
elegans/index.jsp). We also curate variation data from individual 
user submissions; which although time-consuming, are often 
biologically important. 

There has recently been a rapid growth of C. elegans variation 
data generated by whole genome sequencing projects (refs. 50- 
54; Andersen et al., manuscript in preparation; Moerman and 
Waterston, manuscript in preparation). These data sets include 
an increasing number of variations from naturally-occurring 
wild-isolate strains. Motivated by community feedback, we have 
increased the clarity of our representation and display of this 
information. Every variation object processed by WormBase is 
assigned a unique, stable identifier with prefix "WBVar." For 
laboratory-induced variations, we also assign a more directly 
informative public name comprised of a project/laboratory prefix 
(supplied by J. Hodgkin, pers. comm.) and a numerical suffix. 
For naturally occurring variations, the public name defaults to 
the WBVar identifier, making the distinction between these 
objects and the laboratory induced variations obvious and 
immediate. 

We now also collect non-sequence-based information for wild 
isolate strains (http://tazendra.caltech.edu/--azurebrd/cgi-bin/ 
forms/wild_isolate.cgi). Compared with laboratory-manipulated 
strains, there is additional information to capture about the wild 
isolates, such as isolation location, the condition in which it was 
found, and details of how it was isolated. Many wild isolates are 
not stocked at the CGC, and WormBase acts as the central data 
repository for these strains. 

WormBase does not have a mandate to act as a permanent 
repository for variation data, and as the volume of these data sets 
continues to rapidly increase, we become less adequately resourced 
to perform this function. Projects are therefore encouraged to 
submit their data to the NCBI's Database of Short Genetic 
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Variations (dbSNP),^^ an established archive for variation data. 
We act as a submission broker in cases where a laboratory lacks 
the technical resources to conform to the dbSNP submission 
protocols. To date, data from six projects have been integrated 
into WormBase and submitted to dbSNP. WormBase adds value 
to these data sets by performing additional analysis and placing 
them into context with other data types (e.g., Gene). 

Variations are most often submitted to WormBase as a 
molecular change at given location in a specific version of the 
reference genome sequence. As part of the curation, we capture 
and record a short flanking sequence either side of the variation 
feature, disassociating it from a specific version of the reference 
genome. Each release, we re-map all variations and re-calculate 
potential consequences of the molecular changes (e.g., non-sense, 
mis-sense or silent protein-coding mutation) on the latest gene 
models. 

Release Cycle and Database Build 

WormBase is released every two months, with the preparation 
for a release beginning three months in advance. This release 
cycle can give rise to variability in the time between a curator 
transaction (e.g., the update of a gene name, correction of an 
error, or the import of a new data set) and its availability on the 
WormBase website. The delay can be as short as three months 
(if the change is made immediately before we start building the 
release) and as long as five months (if made immediately after, in 
which case it will not be public until the following release). 

Building a WormBase database release is a complicated process, 
the broad stages of which can be described as: (1) data freeze, 
where each contributing consortium partner takes a snap-shot 
of the database (s) in which their curation data are stored; (2) 
data collation, where the curation database snap-shots are 
brought together into a single database; (3) submission of 
updated annotation on core species to the International Nucleo- 
tide Sequence Database Collaboration,^^ to ensure that the 
representation of core nematode data in the nucleotide and 
protein archives is up-to-date; (4) mapping of sequence data (e.g., 
cDNAs, microarray probes, sequence features, variations) to the 
genome; (5) establishing connections between objects of different 
types (e.g., RNAi to Gene), usually via genomic location; (6) the 
large-scale computational analyses discussed earlier, such as 
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